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ABSTRACT 

Extensions to various information theoretic quantities used for nonlinear time 
series analysis are discussed, as well as their relationship to the generalized cor- 
relation integral. It is shown that calculating redundancies from the correlation 
integral can be more accurate and more efficient than direct box counting meth- 
ods. It is also demonstrated that many commonly used nonlinear statistics have 
information theory based analogues. Furthermore, the relationship between the 
correlation integral and information theoretic statistics allows us to define "local" 
versions of many information theory based statistics; including a local version of 
the Kolmogorov- Sinai entropy, which gives an estimate of the local predictability. 

1 Introduction 

The idea of viewing a chaotic dynamical system as an information source was first suggested 
by Shaw [1]. Since that time many authors have proposed methods to characterize strange 
attractors, based on information theoretic quantities. These include information dimension 
[2,3], various measures of information production rate like the Kolmogorov- Sinai entropy 
[1,4], as well as its generalizations [5-7] based on the Renyi entropies [8], and information 
based measures of dependence, such as mutual information [1,9] and redundancies [10,11]. 

Information theoretic measures are only one of many tools which are available for char- 
acterizing nonlinear systems. One class of methods, which includes many of the information 
based statistics, relies only on the invariant measure of the attractor, this includes various 
measures of dimension [3,7,12-17] and the statistics [18-23] based on the correlation integral 
of Grassberger and Procaccia [13]. There are also methods which use dynamical information 
directly, such as the Lyapunov exponents [24-26], Kolmogorov- Sinai (K-S) entropy, nonlin- 
ear prediction error [27-31] and less direct measures of determinism [32-35]. Many of the 
methods based on the invariant measure can also measure dynamical properties, if a delay 
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coordinate embedding is used, however, these methods are not fundamentally dynamical like 
the Lyapunov exponents, K-S entropy and prediction error are. There is also a growing class 
of measures which depend on the topological properties of the attractor [36-39]. 

Many of these methods are related; for example, the Lyapunov exponents are related to 
the Kolmogorov- Sinai entropy through the Pesin identity [4], and the information dimen- 
sion and Lyapunov exponents seem to be related through the Kaplan- Yorke conjecture [40]. 
Further, properties of the unstable periodic orbits, which form the basis of many of the 
topological methods, can also be used to estimate dimension and entropies [41-45]. 2 

The relationship between the Shannon [2] and Renyi entropies [8] and the generalized 
correlation integral [7] has been pointed out by a number of authors [46-48]. It is demon- 
strated that by using this relationship, more accurate estimates of many information theo- 
retic statistics can be obtained, as compared to conventional box-counting methods. It is 
also shown that there are advantages to using statistics based on the generalized entropies of 
Renyi instead of the Shannon entropy. Furthermore, information theory based analogues to 
a number of statistics [18,21,22] are proposed. Using the relationship between the Shannon 
entropy and the generalized correlation integral, "local" versions of many information based 
statistics are proposed, which include measures of the local coupling between variables; as 
well as a localized version of the Kolmogorov- Sinai (K-S) entropy, which is related to the 
local predictability, much like with the local Lyapunov exponents [49-51]. 

2 Entropy, mutual information and redundancies 

In this section the definitions of various statistics based on information theory are reviewed. 
One of the most basic statistics is the Shannon entropy [2], which quantifies the average 
information gained from a measurement. This is usually estimated from a time series by a 
box counting approach; that is, a partition size 6 is chosen, and the data x (We use x as 
a convenient abbreviation for x{t), t = 1, . . . , N) are discretized into integers y = 1, . . . , M 
depending on what bin of size 6 they fall into. In this case, the Shannon entropy is given by: 

M 

H x {x,6) = H x {y) = - £pfo)log 2 [p(y)]. (1) 

where p(y) is the probability of being in the y th bin. The actual value of the entropy 
estimated by this method depends on the partition size 8. For two time series X\ and x 2 the 
joint entropy is given by: 

H 1 (x 1 ,x 2 ,6) = H 1 (y 1 ,y 2 ) = - XlXlKyi^lc^tKyi^)] (2) 

yi V2 

and for m variables the entropy (often called block entropy) is: 

#i(xi, ...,x m ,6) = H x {y u . . . ,y m ) = -J2 p(yi, ■ ■ ■ , Vm) log 2 [K2/i> • • ■ ,Vm)]- (3) 

yi,--;Vm 

One can also define an entropy for continuous variables: 

Ht{x) = - J p(x)log 2 [p(x)]dx (4) 

2 It should be pointed out that these methods do not use the linking of the orbits, like the topological 
methods do. 
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where ). However, this definition has some unusual properties, for one it 

depends on the coordinate system which is used. For example, if z = ax then H(z) = 
H(x) + log(a), also lim^o H (x, 8) ^ H(x). The fact that the entropy depends on the 
partition size allows one to define an information dimension [3] , based on the average scaling 
of the amount of information require to specify a point in the state space within an accuracy 
of 8. The information dimension is given by: 

ft = limZ*&£> ( 5 ) 

The average amount of information that X\ contains about x 2 can be expressed by the 
mutual information: 



h(xi, x 2 , 8) = H 1 (x 1 ,S) + H 1 {x 2 , f>) ~ H 1 (x 1 , x 2 , 8). (6) 

The mutual information is a measure of many bits one can predict about x 2 given a measure- 
ment of X\ with an accuracy of 8. If X\ and x 2 are independent the the mutual information 
is zero, while if x 2 is completely dependent on X\ then I\(xi;x 2 ) = Hi(x 2 ), also note that 
I\(xi;x 2 ) = I\(x 2 ;xi) and I\(xi,xi) = Hi(xi). The mutual information for continuous 
variables is coordinate independent, unlike the entropy, and assuming that there is a small 
amount of noise in the data lim^o Ii(xi; x 2 , 6) = I\(xi;x 2 ). This is because for 8 smaller 
than the noise scale, both Hi(xi,8) and Hi(x 2 ,S) will scale as — log 8, while H{xi,x 2 ,8) will 
scale as —2 log 8. For a noiseless deterministic system one can use the scaling of the mutual 
information with 8 to define a "mutual information dimension" [52]. 

The m dimensional extension of mutual information is called redundancy [10], 

m 

Riixt; . . .;x m ,8) = J^H^x^S) - H 1 (x 1 , . . . , x m , 8) (7) 

i=l 

where x(t) = x 2 {t), . . . , x m (t)) can be either a multivariate signal or a time delay 

embedding [53] x(t) = (x(t), x{t — t), . . . , x{t — (m — 1)t)); for delay coordinates we have: 

Ri{xi\ ■ ■ ■ ; x m , 8) = mifi(xi, 8) - . . . , x m , 8). (8) 

To quantify the amount of information about x m contained in Xi,x 2 , . . . , £ m _i a quantity 
called marginal redundancy (R 1 ) is used [10]: 

R'^x-l, x m _i; x m , 8) = Ri(xi, ...;x m ,8)- Ri(xi, x m _i, 8). (9) 

If x m is independent of Xi, x 2 , . . . , x m _i then the marginal redundancy is zero, while if x m is 
completely dependent on Xi, x 2 , . . . , x m _i then R[(xi, . . . , x m _i; x m ) = Hi(x m ). 

The redundancies and marginal redundancies, like the mutual information, are only scale 
independent for noisy systems. However the quantity 

lim [H 1 (x 1 , ...,x m ,8)- . . . , cc m _i, 8)] (10) 

m— >oo 

has the opposite behavior, that is, for a deterministic system it is scale independent (both 
terms scale as —D\ log 8, for small 8, so the overall expression does not depend on 8), while 
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for a noisy system it scales as — log 8. A related quantity is the Kolmogorov- Sinai (K-S) 
entropy, which is a measure of the mean rate of information creation by the system. Given a 
time delay embedding x(t) = (xi(t), x 2 (t), . . . , x m (t)) = (x(t), x{t — t), . . . , x{t — (m — 1)t), 
the K-S entropy is: 

K x = \im[H 1 (x m \x 1 ,...,x m _ 1 ,6)]/T 

m—>oo K ' 

= J.m^[H 1 (x 1 ,...,x m ,6) - H 1 (x 1 ,...,x m _ 1 ,6)]/T. (11) 

The K-S entropy can also be related to the marginal redundancy [10]; for small delay times 
t we have, 

lim R'Jt) = HAx-A - tK x . (12) 

m— >oo 

This suggests another way to estimate the entropy is: 

~" ^2 — t\ 



m— >oo 



3 Linear redundancies 

Palus et al. [11] define "linear redundancies" which are derived from the continuous case of 
above formulas for the special case of a multivariate gaussian distribution 

p(x) = ' , ,„ e~^ iijXiXj (14) 
y ' (2vr) m /2 ^ i 

where | • | is the determinant, and dj is an element of the matrix ^, which is the inverse of 
the covariance matrix S, with elements given by: "Eij = ((xi(t) — (xi(t)))(xj(t) — (xj(t)))). 
Combining Eq. (14) and Eq. (4) we find that the "linearized" entropy is given by: 



n^x) = io g2 ( ^.j 1/2 ) + 21o g ^ / [J2^ x j I p(*) d * 



m l 

= y log 2 (2vre) + - log 2 |^| (15) 
and the linearized redundancy 3 is: 

1 m „ 1 „ 
= 2 J2 lo &2^) - 2 lo §2 1^1- (16) 

i 

This is equivalent to the form given in Palus et al. [11] since for a symmetric matrix |S| = 
YYj 1 where Xj are the eigenvalues of S. Palus et al. also define what they call a marginal 
linear redundancy by: 

Wfa, x 2 ,..., x m _i; x m ) = K(x 1 ;x 2 ; x m ) - K(x 1 ;x 2 ; x m _i). (17) 

Computing the linear redundancies provides a way of assessing the role of linear cor- 
relations in the estimate of the actual information-theoretic quantity. If the redundancy 



3 One can also define nonlinear statistics based on "local" versions of the "linearized" statistics. 
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and linear redundancy are substantially different, then there is substantial nonlinearity in 
the time series. Thus, one has a qualitative test for nonlinearity. In some cases it can be 
advantageous to transform the original data to have a gaussian distribution, so a nongaus- 
sian distribution is not mistaken for nonlinearity. Palus [54] has also proposed combining 
this test with the method of surrogate data [55]. One computes the redundancies and their 
linearized versions for the original data set, as well as for and ensemble of surrogate data 
sets which are generated to match the linear properties (the power spectrum) of the original 
data set. If the redundancies for the original data are significantly different from the values 
for the surrogates, then one can formally reject the null hypothesis that the data arise from 
a linear process. It is also possible to obtain a quantiative statement about the confidence 
level of the evidence for nonlinearity. Thus, the comparisons with linear surrogate data and 
the comparisons with a linear statistic provide complementary information about the possi- 
ble nonlinearity in the time series. Further, as pointed out by Palus, comparing the linear 
redundancies calculated from the original and surrogate data sets gives a good way to check 
for that the surrogate data sets really are reproducing the linear properties of the original 
data. 

4 Relation to C\ 

The most straightforward way to estimate the quantities defined above is to use a box 
counting approach: the m dimensional space is divided into a number of boxes of size 8. By 
counting the number of points rii in the 2 th box, the probability can be estimated as pi % rii/N 
where N is the total number of points. 4 A number of authors have used refinements to this 
procedure, by adapting the size of the boxes depending on the local density [9-11]. 

It has been shown by Liebert and Schuster [46] that Hi(x,r) can be related to Ci(x, r), 
the generalized correlation integral of order 1. Instead of estimating probabilities p{x,8) 
within boxes of size 8, one can calculate probabilities P(x, r) in regions of radius r about 
each point. The two are related by: 

P i (x,8)log 2 \p,(x,8)] = lo g2[Pt(t)(£,£)] ~ jjJ2 lo &2[ p t(x,r)] = log 2 Ci(x,r) 

(bins) (datapoints) 

(18) 

where pi(x, 6) is the probability of being in the the 2 th bin (a box of diameter 6), i(t) is the 
box that the t th ~ data point is in, and P t (x, r) is the probability of being in the box of radius 
r (or diameter 6 = 2r) centered at the point x(t). Since the two boxes are the same size and 
very close to each other (they overlap), we can heuristically justify the relation pi(6) ~ Pt(r) 
for 6 = 2r (using maximum norm). 5 

A natural estimate 6 of -Pt(r) is given by B(x(t),r), which is the fraction of data points 

4 For small rii Grassberger [56] has derived a correction to the formula, which for the Shannon entropy 
is pi log 2 [pi] ~ ^log 2 (-AT) — ^(rii) — ^ n . 1 ^ 1 ' j where ^(e) is the digamma function (see also Wolpert and 
Wolf [57]). 

5 For maximum norm, a radius r corresponds to a box size (diameter) 6 = 2r. For the euclidean norm, in 
an m dimensional space, a sphere of radius r has the same volume as a cube of diameter 8 = Cm^ m r, where 
c m is the the volume of a m-dimensional unit sphere. 

6 Grassberger [56] has also derived a small n t correction to this formula, log 2 Pi ~ 
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(excluding x(t) itself) within r of the x(t). That is, 

P t (x,r) « £(d?(t),r) = ^4? ( r " 11^*) " ^ ^ 

where is the Heaviside function, n t is the number of points within a radius r of x(t), 
and || • || is some measure of distance (we use maximum norm). The generalized correlation 
integral of order 1 is a (geometric) average of the B(x(t),r) probabilities, and is given by 

log C(x,r) = ^J2 B(x(t),r). (20) 

Note that Eq. (19) is a simple form of kernel density estimation [58], often this would be 
written in the form: 

where K(z) = 1 if z < 1 and otherwise is zero. This particular kernel is far from optimal (in 
fact, Silverman [58] calls it the "naive estimator"). However, even this crude form of kernel 
density estimation is generally considered superior to using a multidimensional histogram 
(binning). Better results can often be obtained, if one uses a kernel K(z) which decreases 
with increasing z or even a kernel whose width depends on the local density (see Ref. [58] 
for details). However, in this paper Eq. (19) will be used, so we can estimate the Shannon 
entropy from the generalized correlation integral of order one [7, 16,46] 

H 1 (x 1 ;x 2 ; . . . ;x m ,S) % - log Ci(x 1} x 2 , . . . , x m , r). (22) 

We can now express the redundancies (Eqs 7 and 9) and K-S entropy (Eqs. 11 and 13) in 
terms of C\, and since kernel density estimation is being used instead of binning, we can 
expect more accurate results with limited data sets. 

5 Generalized entropies and redundancies 

Instead of using the Shannon entropy to calculate redundancies, we can generalize these 
statistics by using the Renyi entropies [8]: 

H q (x, S) = j^- log 2 5>(x, £)]*. (23) 

It is easy to show that the limit as q — > 1 leads to the Shannon entropy. Again we can 
relate probabilities p{x, 6) within boxes of size 8, to probabilities P(x, r) in regions of radius 
r = 6/2 (for maximun norm) about each point: 

EW^^E^V)] 9 " 1 ( 24 ) 

i t 

(^&(n t + 1) — \og 2 (N) — j , which is used in the calculations below. 
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so we have 7 , 



H q (x, 6) % lo S2 



1 



q-l\ 



N t 



7?E vE®(r-\\x(t)-m\\)\ =-log 2 [C g (x,r)] (25) 



N-7 



where C q {x,r) is the generalized correlation integral [7,16]. The idea of relating the Renyi 
entropies to the generalized correlation integral is by no means new. It was mentioned in the 
review by Grassberger et al. [47], the q = 1 case was used by Liebert and Schuster [46] and 
Pompe [48] used the correlation integral to calculate H 2 . However, by using Eq. (25) we can 
define generalized redundancies R q (x,r) and R' q {x,r) in terms of C q {x, r). Pompe [48] has 
also proposed what he calls a generalized mutual information which he expresses in terms 
of the second order (q = 2) correlation integral. Pompe's generalized mutual information is 
given by: 

Q 2 (x t , x m ) = H 2 {x m ) - H 2 (x 1 , x m ) + H 2 (x 1} • • • , z m -i) (26) 

which is the same as the second order (q = 2) generalized marginal redundancy. 

While q = 1 leads to the natural definition of the entropy 8 there are reasons to prefer 
different values of q. For example, for all q except q = 2 there are corrections to Eq. (24) 
for small r. Grassberger has derived the asymptotic form of these corrections [56], but for 
finite length sets, the best statistics, at small r, are obtained by using q = 2 (In contrast, 
for box counting methods there are small rii corrections for all q [56]). Another reason for 
using q = 2 is that it is the fastest to compute of all the generalized correlation integrals. 
Further, the correlation integral has a dynamic range of 0(N 2 ) as opposed to 0(N) for box 
counting methods, this permits the use of smaller values of r [17]. Finally, the speed of 
box counting methods [59] is not an issue, as there are numerous fast correlation integral 
algorithms available [60-62] so one can compute the generalized redundancies for small r in 
0(N log N) time or faster. 

5.1 Generalized linear redundancies 

We can also define linear versions of the generalized entropies, 



(27) 



using \q£ij\ = q m \(ij\ and the normalization condition of the gaussian we find: 



H *W = (I _ q ) bg2 



16, 



|(a-i)/2 



g m /2(2vr)(9- 1 ) m /2 



- log 2 (2vr) + -log 2 |S<,-| + 2{q {y W 



7 One can also express the probability in m dimensions as P = k / c m r m (k) m where r m (k) is the distance to 
the & th nearest neighbor, and c m is the volume of a "sphere" of radius r; it depends both on the embedding 
dimension and the distance norm), so it should be possible to make "fixed mass" versions of the generalized 
redundancies as well. 

8 That is, q = 1 is the only one of the generalized entropies which is an additive quantity. 
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and for the generalized linear redundancies: 



1 m 1 
n W = 2 E lo S 2 (^) " 2 lo §2 1^1 (29) 

i 

which is the same as Eq. (16); that is, the linear redundancies do not depend on q, but the 
"linear entropies" (Eq. (28)) do depend on q, as does the linearized version of the generalized 
correlation integral 



C q {x) 



|^.|(q-l)/2 1 1/(1-9) 



g m /2(2vr)(9-i)W2j ' ( 30 ) 

Therefore, the idea of comparing a nonlinear statistic to its "linearized" version can be 
extended to all the statistics based on the correlation integral {e.g. Refs. [5,18,19,21,22]). 

6 Applications 

6.1 Clean computer generated data 

As the first test, the correlation integral based redundancy analysis is applied to 8192 points 
from the chaotic Rossler equations [63] (with parameters a = 0.15, b = 0.2, c = 10 and a 
sampling time of St = 0.314, these are the same parameters used by Palus [64]). In Fig. 1 we 
show the linear, C2, and C\ based redundancies and marginal redundancies as a function of 
the time delay t, with r = O.lcr, where a is the standard deviation of the data set. The lines 
are for increasing embedding dimensions (m = 2, . . . , 8) starting at the bottom of the graph. 
Eq. (12) suggests that as m is increased the marginal redundancy curves as a function of t 
should accumulate to a line which has a slope equal to —1 times the K q entropy. From the 
slope of the C2 based marginal redundancies (Fig. lb) it can be seen that the K 2 entropy is 
roughly 0.03 bits/timestep (a similar value is found using the estimate of Eq. (11)). However, 
a reliable estimate of the K\ entropy can not be obtained with this number of points (see 
Fig. Id) using either the method of Eq. (11) or Eq. (12). 9 That is, as suggested above, by 
using the C2 based redundancies we can either use smaller r or fewer data points. In Fig. l(e- 
f ) the linear redundancies and marginal redundancies are shown for comparison. Notice the 
difference between the t dependence of redundancies and their linearized versions. 

6.2 Real data: SFI-A 

This method is also applied to 8192 points from a chaotic laser experiment, which exhibits 
Lorenz-type chaos with a correlation dimension of roughly 2.05 and a positive K 2 entropy [65] . 
This data set (A.cont) was part of the time series competition sponsored by the Santa Fe 
Institute [66]. In Fig. 2 we show the results of the analysis for the linear, C\, and C2 based 
redundancies as a function of the time delay t and for embedding dimensions m = 2, . . . , 8. 
For the C2 based redundancies we use r = O.lcr, however, when using this value of r for the C\ 
based redundancies there are positive slopes at large m and t for the marginal redundancy, 
as in figure Fig. Id. The best results are obtained for r = 0.25cr, which is what is shown 
in Fig. 2(c-d). The linear redundancies and marginal redundancies are shown in Fig. 2(e-f) 
for comparison. The difference between the shapes of the redundancies and their linearized 
version clearly shows that there is nonlinearity in this data set. From the slopes of the C2 



3 The results are slightly better at r = 0.25cr. 
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Fig. 1. Redundancy analysis for Rossler data set, for embedding dimensions m = 2 — 8 and time delays 
r = 1 — 60 sample times. All curves are for r = 0.1a where a is the standard deviation of the time series, 
(a) C2 based normalized redundancies (i?2(T)/(m— 1)). (b) C2 based marginal redundancies (R' 2 (t)). 
(c) C\ based normalized redundancies (i2i(r)/(m— 1)). (d) C\ based marginal redundancies (i2^(r)). 
(e) Normalized linear redundancies (7£(r)/(m— 1)). (f) Linear marginal redundancies (1Z'(t)). 
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Fig. 2. Redundancy analysis for SFI-A data set, for embedding dimensions m = 2 — 8 and time 
delays r = 1 — 60 sample times. Curves (a) and (b) are for r = 0.1a, curves (c) and (d) are for 
r = 0.25<7, where a is the standard deviation of the time series, (a) C2 based normalized redundancies 
(i22( r )/( m— !))■ (b) C2 based marginal redundancies (R' 2 (t)). (c) C\ based normalized redundancies 
(-ffi(r)/(m— 1)). (d) C\ based marginal redundancies (i2^(r)). (e) Normalized linear redundancies 
(7£(r)/(m— 1)). (f) Linear marginal redundancies (1Z'(t)). 
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based marginal redundancies it is seen that the K 2 entropy is roughly 0.025 bits/timestep 
(we find K 2 ~ 0.03 bits/timestep from Eq. (11)). However, we are unable to get an estimate 
of the Ki entropy with this number of data points. Palus [64] also examines this data set 
and gets very similar results to Fig. 2(c-d) using an adaptive box counting method. 

7 Relation to other statistics 

The connection between the entropy and the correlation integral allows us to relate other 
correlation integral based statistics to their analogues from information theory. For example, 
if the sequence is IID then R q will be zero; this is the idea behind the BDS test [18-20]. 
Putting R q in terms of the correlation integral (for delay coordinates) we find: 

R q (x,r) = log 2 (C q (x,r)) - \og 2 ([C q (x 1} r)] m ) (31) 

which is very similar to the BDS statistic 

BDS g (x,r) = C q (x,r) - [C q {x u r)] m . (32) 

Green and Savit [22] have also proposed a statistic, based on the correlation integral, to 
quantify the amount of additional information in x m about X\ which is not a result the de- 
pendence of X\ on x 2 , . . . £ m _i. One measure of this is given by the conditional redundancy 10 : 

R q {x,\ ; x m |x2, . . . , x m -i , r) = R q {x.\ , . . . , x m _i ; x m , r) — R q (x 2 , ■ ■ ■ , x m -\ ; x m , r) (33) 

1 / Cq( X l) • • • ) x m-li r )C q {X 2 , ■ ■ ■ , X m , 7") \ 

For a time delay embedding this reduces to: 

R q ( Xl ; x m \x 2 , x m _!, r) = - log 2 ( — 'V'*,™' 1 '^ "T J • (34) 

The statistic proposed by Green and Savit is: 

>/ \ / C q (xi, ■ ■ ■ j ^m-ij r)C q (x 2 , ■ ■ ■ , x m , r) \ 

Q q (x 1 ;x m ,r) = l- — — — , (35) 

y Uq^Xi, . . . , X m , r )U q {X2, . . . , X m _i, T) J 

which for time delays reduces to the statistic of Savit and Green [21] 

6 (x ■ x r) — 1 — ( C g (xi, . . . , x m _i,r) 2 \ 

\C q (xi, . . . , x m ,r)C q (xi, . . . , X m _2, r) J 

We are not advocating the use of these information theory based statistics in place of the 
BDS or Green and Savit statistics, but rather, pointing out that there are information theory 
based analogues to these statistics. An important distinction between these statistics, and 
statistics like the correlation dimension and the K-S entropy, is that they are evaluated at a 
fixed r as opposed to taking the limit as r — > 0. Another measure of this type is the 'ApEn' 
statistic advocated by Pincus [23], which is just the K 2 entropy [5,6] evaluated at a fixed m 
and r. 



10 We are grateful to Milan Palus for pointing out that this difference of marginal redundancies can be 
written as a conditional redundancy. 
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7.1 Cross-redundancies 

Estimating the relations between multiple time series is an important problem. Recently 
statistics have been proposed to estimate nonlinear correlations between variables [22,67]. 
One simple measure, which we call the cross-redundancy, is given by: 

I q ( Xl ; x 2 , l,r) = H q ( Xl (t),r) + H q {x 2 {t + l),r) - H q ( Xl (t),x 2 (t + l),r) (37) 

where / is a lag time, as in a cross-correlation (A similar statistic has been used by Vastano 
and Swinney [68] for measuing infomation transport in spatiotemporal systems). That is, 
the cross-redundancy is just the mutual information between X\ and a lagged value of x 2 . 
The cross-redundancy can also be expressed in terms of the correlation integral: 

/.to,*,/..-) = -i*g^ C3(ll(iWt+i) , r) )■ (»> 

It can be seen that this is very similar to the quantity defined in Eq. (33) (the analog of the 
statistic of Green and Savit), the important difference being the use of the lag time. As with 
the other statistics we can define a linearized version of the cross-redundancy: 

l q ( Xl ;x 2 ,l) = -Uog 2 (l - (E XlX2 (l)) 2 ) (39) 

where it is assumed that both of the series X\ and x 2 have zero mean and unit variance, and 
£ XlX2 (Z) = (xi(t)x 2 (t + 1)) 1 / 2 is the cross-correlation function between X\ and x 2 as a function 
of the lag time I. 

As an example, the second order (q = 2) cross-redundancy and its linearized version 
are computed for the x and y components of the Lorenz equations [69] (with parameters 
a = 10, j3 = 8/3, and r = 28, and a sampling time of St = 0.04). In Fig. 3 we show the 
cross-redundancy (solid lines) for r values 0.01a", 0.02cr, . . . , O.lcr as a function of lag time (the 
top curve is r = 0.01cr), and the linear cross-redundancy (dashed curve). Notice both the 
cross-redundancy and its linearized version show a peak at a lag of roughly —2 time steps, 
but that the cross-redundancy also shows (nonlinear) correlations at longer lags, which are 
not detected by the linear cross-redundancy. That is, linearly y looks just like a lagged 
version of x, but by using the cross-redundancy it is seen that there is nonlinear coupling 
between x and y. 

7.2 Local information based measures 

The relationship between the correlation integral and the entropy makes it easy to define 
"local" information theoretic measures (local in state space), for example, the local version 
of the Shannon entropy is: 

(l \ 

h^lr) = -log 2 -^©( r - \\x(t)-x(s)\\)\ = -log 2 (B(x(t),r)). (40) 

V >** ) 

We can also define local redundancies, based on the inner sum of the correlation integral 
(_E?(x(t), r)), as well as a local version of the K-S entropy: 

&i(x(t)) = lim [/ii(xi(t), . . . ,x m (t),r) - /ii(xi(t), . . . ,x m _i(t),r)]/T 



lim — 

m— >oo q- 



( B{x 1 {t), . . . ,x m _!(t),r) 
2 \ B(x 1 (t), . . .,x m (t),r) 



(41) 



12 



lorenz x-y 

1 1 I 1 1 1 1 I 1 1 1 1 I 1 1 1 1 I 1 1 1 1 I 1 1 




U I I I I I - i - I - J -I- ' I I I I' ■ I- - I - J - -I- I I I I I I L 

-40 -20 20 40 

lag 



Fig. 3. Cross-redundancy between the x and y components of the Lorenz equations as a function 
of time lag (solid curves). Top curve is for r = O.OIct, next is for r = 0.02a and so on to r = 0.1a. 
Dashed line is for the linear cross-redundancy. 

where (xi(t), x 2 (t), . . . , x m (t)) is a time delay embedding: (x(t), x{t — t), . . . , x{t — (m — 
1)t). The local K-S entropy gives us a measure of the local predictability, without actually 
doing nonlinear prediction, or calculating local Lyapunov exponents [49-51] 11 . If one is just 
interested in determining how the degree of predictability changes across the state space, 
we suggest that calculating the local entropy may be a numerically easier way to get this 
information that using nonlinear prediction or local Lyapunov exponents. 

As an example of the local K-S entropy we generate N = 65536 points of the x and z 
variables of the Rossler equations with a time step St = 0.314. The local entropy is then 
estimated in embedding dimensions m = 3, ... ,8 using the x component, a time delay of 
t = 5 (the first minimum of the mutual information) and r = a/4: for the first 500 points 
along the trajectory. (Since we are only interested in finding how the predictability changes 
in different regions of state space, and not in the exact value of the local entropy, we do not 
take the limits as r — > and m — > oo, but instead evaluate the approximate local entropy at 
finite r and m). By examining the Rossler attractor it is clear that most of the stretching and 
folding occurs when z is large, therefore, we expect the local entropy to be large only when 
z is large. In Fig. 4 the approximate local entropy for embedding dimensions m = 3, . . . , 8 
is shown, as well as the z component of the Rossler equations. Notice that the local entropy 
has "spikes" when z is large, as was expected. In the figure the curves are shifted to the right 
by t = 5 for each increasing embedding dimension, because of the time delay embedding. 

Minimizing mutual information is one criteria that might be used to get a good embed- 
ding, in fact, this was suggested by Shaw and explored by Fraser [9]. Casdagli et al. [70] 

1:L The local K-S entropy is related to the local Lyapunov exponents of Eckhardt [51]. The "local" exponents 
of Abarbanel et al. [49,50] are in fact finite time Lyapunov exponents, that is they are more like finite time 
averages of Eq. (41). 
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Fig. 4. Top panel: z component of the Rossler equations. Next 6 panels are the (approximate) local 
entropy calculated from the x component for embedding dimensions m = 3, . . . , 8. 



suggest minimizing another quantity which they call the "conditional variance" 



(42) 



For delay coordinates, z = x{t + t) and y = (x(t), . . . , x{t — (m — 1)t)). Casdagli et al. [70] 
argue that conditional variance is a more appropriate criteria than mutual information. 
Another benefit of the conditional variance is that is a local measure. Cenys et al. [67, 71] 
have also proposed statistics based on the conditional variance. 

Casdagli et al. also point out that the quality of the embedding depends on the coupling 
between the variables, they illustrate this idea using the Lorenz equations: 



x = a(y — x) 
y = rx — xz — y 
z = xy — j3z 



(43) 



they point out that when x is small the coupling between y and z is weak, so in the presence 
of any noise one expects a poor reconstruction when x % 0. This suggests that we might 
want to look at a quantity like ii(z(t); y(t)\x(t)) along a trajectory in the state space to 
determine how coupling between y and z depends on the position in state space. We can 
express ii(z(t);y(t)\x(t)) in terms of the inner sum of the correlation integral (5(x(t),r)): 



h x {y(t),z(t),r) + h x {x(t),y(t),r) - h x {x(t),y(t),z(t),r) - h x {x(t),r) 
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log 2 



' B(y(t),z(t),r)B(x(t),y(t),r) 
B(x(t),y(t),z(t),r)B(x(t),r) 



(44) 



As an example, we generate N = 65536 points of x, y, and z from the Lorenz equations 
with a time step of St = 0.01. Using r = 3.25 (roughly 1/4 of the standard deviation) 
ii(z(t);y(t)\x(t)) is computed for the first 500 points along the trajectory. In Fig. 5 we show 
ii(z(t);y(t)\x(t)) versus x. Notice that ii(z(t); y(t)\x(t)) is small near x = 0, since when x 
is small the coupling between y and z is weak. 




Fig. 5. The local mutual information between z and y given x, (ii(z;y\x)) plotted against x for 
a short trajectory of the Lorenz equations. When x is small the coupling between y and z is weak, 
therefore, i\(z;y\x) is small. 



8 Conclusions 

We have discussed the relationship between various information theory based quantities and 
generalized correlation integral of order 1, and extensions of these quantities based on the 
generalized correlation integral. It has been shown that the correlation integral approach 
has several advantages over box counting methods (especially for q = 2), and that the idea 
of comparing the t dependence of a statistic to its "linearized" version can be extended to all 
the statistics based on the correlation integral. Finally, we have introduced new information 
theoretic statistics, including "local" versions of several statistics based on the inner sum of 
the correlation integral. 
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